Back to Home

Introduction

This project consists of a comprehensive comparison of execution time and memory use in Python, R for computational operations common in data science, namely generic loop and vectorized operations, matrix multiplication and inversion, and some popular computationally heavy statistical and ML algorithms.

Why R and Python

R and Python are arguably the two most popular open-source programming language used in data science. Python is also widely outside for programming outside data science and ranks first in many popularity indices like the TIOBE index. Its popularity in data science is mainly due to its extensive ecosystem of libraries contributed by the wide programming community, such as NumPy, pandas, and SciPy, which provide data manipulation, analysis, and visualization. Python’s ecosystem also has libraries beyond data science, such as for web development like Django and Flask and natural language processing like NLTK and spaCy. Python has a user-friendly syntax and extensive documentation on GitHub and is particularly popular in the data science industry and in the academic machine learning community. Python libraries are typically built in C or C++ and provide efficient implementations of most popular data science algorithms while keeping the code simple and readable.

R is a programming language designed specifically for statistical computing and graphics. Likewise to Python, R’s main strength is its extensive collection of packages contributed by the community of users, which mainly consists of statisticians and researchers. It features built-in functions for popular statistical algorithms like linear regression as well as flexible graphical capabilities, and there are many high-quality visualization tools through packages like ggplot2 and plotly. R’s syntax is tailored for statistical analysis, using vectors as its data structure for storing and manipulating data and including many built-in reproducibility features. Its packages are also typically implemented in C or C++.

Other popular high-level programming languages in data science include Java and the relative newcomer Julia, and common proprietary programming languages include SAS, Stata, and MATLAB.

Setup

For each algorithm, I measure execution time and memory use using the benchmark package in R and the time and profile libraries in Python. The algorithms were meant to be written in a way that is as structurally similar as possible in both languages, but the implementations of base functions vary greatly between the two languages. Ultimately, this is what makes the comparison interesting since there can be great differences in execution time for similar operations, but it also means that an algorithm can be poorly optimized in a language and artificially seem much faster in the other. For instance, people often criticize the speed of loops in R and my tests agree with this criticism, but these operations can almost always be made much faster by using vectorized operations.

I simulated give datasets in R typically from a standard Normal distribution, with sample sizes varying \(10^3\) to \(10^7\). I ran scripts in R and Python on these simulated datasets and their execution time and memory use were recorded in CSV files. I used R markdown to generate this webpage, using the plotly package to generate interactive plots. I also included theoretical Big-O complexity of the algorithms based on the most simple form of the algorithm.

All code for the scripts is available in the GitHub repository and all code use to generate my website using the theme … is available in the Github repository …

The following algorithms were tested: a simple loop and a vectorized implementation, matrix multiplication and inversion, linear regression, a bootstrap algorithm, and a SVM algorithm. When possible, I have implemented a simple version of these algorithms using only base functions in the language as well as a popular package/library. For instance, for linear regression I use the lm() function in R and the Scikit-Learn library in Python as well as an algorithm using only matrix multiplication and inversion. Of course, these two matrix operations are themselves algorithms with many different possible implementations that I do not know anything about and which are apparently active areas of research in computer science. I am somewhat biased since I mainly use R in my academic work and am more familiar with the most optimized packages in R rather than Python.

Comparison

Loop operations

This is fake data.

library(plotly)

sample_sizes <- c(100, 1000, 10000, 100000)
python_times <- sample_sizes/100
r_times <- sqrt(sample_sizes)
julia_times <- log(sample_sizes)

data <- data.frame(
  SampleSize = rep(sample_sizes, each = 3),
  Time = c(python_times, r_times, julia_times),
  Language = factor(rep(c("Python", "R", "Julia"), times = 4))
)

p <- plot_ly(data, x = ~SampleSize, y = ~Time, color = ~Language, type = 'scatter', mode = 'lines+markers')
p

Discussion

Speed depending on the machine

What does memory usage really mean and how can we measure it?

Other considerations than speed and memory

My experience using interactive visualization and a website

References